Gun identification from gunshot audios for secure public places using transformer learning

Increased mass shootings and terrorist activities severely impact society mentally and physically. Development of real-time and cost-effective automated weapon detection systems increases a sense of safety in public. Most of the previously proposed methods were vision-based. They visually analyze the presence of a gun in a camera frame. This research focuses on gun-type (rifle, handgun, none) detection based on the audio of its shot. Mel-frequency-based audio features have been used. We compared both convolution-based and fully self-attention-based (transformers) architectures. We found transformer architecture generalizes better on audio features. Experimental results using the proposed transformer methodology on audio clips of gunshots show classification accuracy of 93.87%, with training loss and validation loss of 0.2509 and 0.1991, respectively. Based on experiments, we are convinced that our model can effectively be used as both a standalone system and in association with visual gun-detection systems for better security.

www.nature.com/scientificreports/ Previously, researchers widely used Convolution neural network (CNN) architectures to solve computer vision problems. Although promising, CNN-based architectures have some weaknesses. Firstly, as the convolutions work with constant window size, the model helps in finding the local information rather than long-range spatial relationships between different parts of the image and the complete image 3 . Secondly, there is some loss of local information through max-pooling 4 . Inspired by Natural language processing (NLP), Vision transformers have recently been proposed as an alternative to CNNs and have shown promising results in the field of computer vision 3 . Vision transformer is free of convolutions and identifies an image as a sequence of patches, hence overcoming the issues of locality and translation invariance faced by CNN. This approach has been observed to use the hardware resources more efficiently than CNN and could be pre-trained on the public ImageNet dataset with fewer resources 3 .
In this paper, we have investigated whether we can classify different gun types from audios of their shots. The hypothesis arises as experienced military personnel can accurately tell which gun the shot is fired from by listening to audio. To test this, we created a dataset that consists of 1-s audio of shots. Transformers have proved their success from NLP to vision tasks. However, they were never used to classify audio. To replicate a human's sense of understanding of audio, we used Mel-frequency to represent audio and developed a transformer-based classifier to predict the class of gun.
Major Challenges faced while working in this area are as follows. firstly, creating such a dataset is difficult and costly due to the legalities and risks involved. We created a dataset using audio from YouTube videos. Secondly, no such features were proposed earlier for attention-based approaches for gun audio classification. It is not known whether we can use attention directly on shot audio or if there exists some feature that can use attention to give a better result.
Our main contributions in this paper are the following: 1. We manually created a 1-s gun audio shot dataset and made it available for public use. 2.We optimized transformer hyperparameters, showing that the vision transformer has significantly improved the gunshot detection accuracy compared to the other state-of-the-art algorithms.
3. Our proposed approach has the advantage of showing sublime results on every type and aspect of the data that led to obtain unparalleled results for the task of gunshot detection when compared with state-of-the-art algorithms.
The rest of the paper is divided as: related work, the methodology that we used and dataset creation, results and discussion, conclusion, limitations and future scope.

Related work
Choi et al. highlighted CCTV's significance for better police service 5 . They also studied CCTV-based intelligent security systems for constructing crime-zero zone 6 . In 2018, they further analyzed the feasibility of security systems based on their economic value 7 . Adding an audio-based gun-shot detection method will enhance the reach of such systems. With a nominal increase in cost, a completely new sensory ability increases its usefulness. Liang et al. 8 proposed a method that can detect the shooter's position using only a few user-generated videos that include the sound of gunshots. Liang et al. 9 in 2017 solved the issue by developing a temporal localization framework for intensive audio events in videos. The localization results are improved by the proposed method using Localized Self-Paced Reranking (LSPaR).   10 proposed a robust neural network-based approach for audio classification. They employed an encoder that contains a sequential stack of neural networks. Further, they also used a temporal-based interpolation approach for performing the scene-level embeddings. Banoorupa et al. 11 proposed a hybrid fingerprinting based approach for performing audio classification using LSTM. In this approach, the fingerprints were created employing the MFCC spectrum and finally converting the spectrum obtained into digital images.
Phan et al. 12 proposed a multiclass audio classification solution for polyphonic audio event detection. They divided the event categories into multiple sets and created a multi-class problem employing a divide and conquer approach. Wang et al. 13 proposed a 2D CNN based technique for audio classification.This algorithm was widely used in various speech recognition and classification tasks. Zhang et al. 14 propose an AED module called Multi-Scale Time-Frequency Attention (MTFA) it informs the model where to focus along the time and frequency axes by collecting data at different resolutions, which has not been taken care in the past. Zhang et al. 15 and Shen et al. 16 proposed a a multiscale time-frequency convolutional recurrent neural network (MTF-CRNN) for sound event detection 15 .
Deep learning has evolved a lot over time. Earlier CNNs were being used and continue to work very well. Recently, Transformers were introduced, and state-of-the-art methods have proved their performance in both NLP and Vision tasks. In this paper, we are searching for the best way to classify guns based on their audio. We have done many experiments for this task with CNNs and transformers.
Traditional machine learning based researches. This subsection discusses the machine learningbased audio classification approaches, and researcher's contribution are represented in Table 1. Gunshot audio depends on various aspects, i.e., (1) firing power (size of bullet), (2) length and width of the nozzle, and (3) environment (echo). If firing power or length/width of gun nozzle changes, it reflects the weapon's is changed. Vrysis et al. 42 compared 1D and 2D CNNs for audio classification using various features. 2D CNN with spectrogram worked best according to them. Nanni et al 39 used an ensemble of CNNs to classify animal sound using spectrogram and some handcrafted features. In our work, we have utilized Mel-frequency and spectrogram-based features to identify the audio sound.
Transformer as new state of the art. Transformers are first proposed by Vaswani et al. 43 for the textclassification task of NLP. It is complete attention based, removed the recurrent nature completely; is faster to train, and offers better performance. Experimenting with pairwise and patch-wise self-attention, Zhao et al. 44 found both outperform CNNs. Dosovitskiy et al. 3 used transformers for image recognition. They used a simple transformer as an encoder. Transformer architecture has proven success in different domains. Inspired by this success we employed transformers to classify audio of gun shots.
The research based on gun identification has been done by Kiktova et al. 45 which was an extension of an intelligent acoustic event detection system. The work was based on extracting a variety of features where by using mutual information for the feature selection, the length of feature vectors were reduced. Thereafter, the Hidden Markov Model and Viterbi-based decoding algorithm utilized those obtained features. In a recent publication of 2021, Dogan 46 presented work on predicting the gun model by identifying sound and developed an intelligent audio forensics tool. Dogan has used the fractal H-tree pattern-based classification method, where fractal and statistical features were utilized by SVM and kNN classifiers. Researchers have not explored the classification of gunshots; research has also been done on measuring and analysing the gunshot sound as they may cause hearing impairments 47 . In their study, acoustic data was collected from four different guns where sound was captured at a sampling rate of 204.8 kHz. The method developed to measure gunshot was based on using image processing

Methodology
Proposed approach. In this section, the methodology which has been used to classify gunshots is discussed in detail. The proposed work approach has not been used earlier to classify audios and is thus considered to be a novelty. The approach used to carry out the work follows some steps, which are given below: • Vision transformer For the image classification task, a variation is made to the traditional transformer architecture used in NLP. In our approach, the input part is processed by the encoder and the output from the encoder is fed to a Multilayer perceptron after flattening it. No decoder is used, as shown in Fig. 2. The approach treats each patch from the image like text. The transformer takes a linear vector with positional embedding as input. Therefore, first, 2D patches from the image are flattened to linear vector. Then, the positional embedding and class token are added to it. As shown in Fig. 3, the encoder part has many encoder blocks. Each encoder block has multiple layers of multiheaded self-attention mechanisms. The output from the encoder is normalized and sent to dense layers for image classification as shown in Fig. 3. The model is inspired from Dosovitskiy et al. 3 .
Pre-trained model have proven their success in many researches. After identifying diminishing gradient problem, researchers have proposed using residual connections in Resnet50 paper 49 . In our experiments, while training couple of times, we found that even the Resnet50 model quickly overfits. We tried using Dropout with high value, still problem of very high difference in training and validation loss exists. As shown by 50 , transformers are better at handling such situations. www.nature.com/scientificreports/ Among various available architectures of vision transformers, we have used the L32 model for this research. The input to the transformer is a 3-channel (RGB) image of size 224 × 224 . The image is divided into patches of shape 32 × 32 . Each patch is given as input to the linear transformation layer, which changes into a fixedlength vector. Vector of each patch has given its position information by adding a position embedding. As our problem is boiled down to image classification, a class descriptor token is also added to the patches. We pass the combined vector to the transformer encoder block. In the L32 model, we have 24 self-attention encoder blocks through which input passes for feature extraction. As our problem needs, we added two dense layers with 400 and 100 nodes, respectively. Both layers use ReLU activation. The dense layers are followed by the softmax layer, which classifies input into three classes: handgun, rifle, and none. The L32 model we choose is pre-trained on ImageNet dataset. We fine tune the model keeping each layer trainable.The Adam optimizer is used while fine-tuning with learning rate schedule of 1 × 10 −3 using a warmup phase followed by a linear decay, and categorical cross-entropy is used as the loss function while training.
In this paper, we have used the concept of multi-head attention instead of single head attention by using d model dimentional keys, values and queries as used by Vaswani et al. 43 .

Experimental setup
Dataset collection and pre-processing. The dataset which has been used to carry out work contains short Gun Shot audio clips of 1 s. First, 200 videos containing multiple gunshots, were collected from YouTube. Among these 200 videos, 101 videos are of "rifle" shots, and 99 videos have "handgun" shots. The gun-shot is usually of less than 1 s, which is randomly taken audio from before and after the shot act.
To segment the 1-s shot audio, manual annotation was corrected up to a millisecond by using a video player. Thereafter, ffmpeg tool was used to crop annotated audio segments from videos. We kept one to several shots from each video, given they are different. In each audio clip, the background noise padding is at a different position.
To extract audio from random noise or audio containing no gun-shot, the annotations given to extract gunshot audios have been used. We have simply left the audio where the gunshots were present as these were already marked by us manually to create the dataset for the other two classes. A total of 322 audio images with noise was obtained.
Finally, all the audios were manually verified, and any unclassified audio was removed. Each audio for rifle and handgun class contains only one gunshot, which is either handgun or rifle. The dataset contains a total of 1661 audios comprising 649 handgun sample shots, 693 rifle sample shots, and 322 none or random noise in .mp3 format.
Each audio file for every class is split into set length frames in order to provide the network with sufficient and relevant data. We divided the original audio files into two categories as a first step: training samples (which constituted 70% of the original data) and test samples (which constituted 30% of the original data). This is done to prevent the network from overfitting and producing inaccurate results when tested on data that was previously used to train the network. With K = 5, we performed a K-fold cross-validation to effectively test the proposed network, as shown in Table 2.
Comparison with other datasets and system configuration. We test our method on two additional available datasets to confirm the efficacy of the proposed method. (1) TRECVID Gunshot Videos (TREC): we   52 , we have extracted 117 audio files that include gunshots. We run experiments on these two datasets, and Table 4 reports the comparison on testing accuracy for each dataset.
All the experiments were conducted inside google collaboratory. The system provided in google collaboratory has Nvidia k80 GPU with 12 GB of VRAM. The system has an Intel Xeon processor with two cores, 12 GB of RAM, and 25 GB of disk space. All the implementations were done in python 3.

Results and discussion
This section describe the results. Table 1, shows audio classification algorithms employed in the past, with the comparison of their performance on a yearly basis. Further, we compared our proposed approach with algorithms developed by other authors, as shown in Table 4. Also, in order to check the generalizability of our proposed approach, we compared the results with two available datasets, (1) TRECVID Gunshot Videos (TREC), and (2) UrbanSound Gunshot Videos (Urban). Our proposed approach outperformed the state-of-the-art methods in all three datasets, as seen in Table 4. It is to be noted that our proposed approach produced testing accuracy of 89.0-90.0%. We could see that zero-shot federated learning produced accuracy within 83.5-86.0% (Table 4). Further, DNN Ensemble model produced accuracy of 83.0-84.5%, and capsule network produced the audio classification accuracy between 82.2 and 83.6% as shown in Table 4.
So far, CNNs 53-55 have dominated audio classification tasks. CNN's work is based on where it extracts significant features and edges by applying filters to a section of the data 56 . This allows the model to learn only the most important elements from the data rather than the fine details. Moreover, our proposed model works on the principle where the complete audio data is put into it, rather than only the sections that the filters can extract (or find relevant). This serves as a reason why our proposed approach outperforms the state-of-the-art approaches.
We have tried using raw audio signals directly to train Resnet50 as a baseline. When resnet50 is fine-tuned on the raw audio signal, the model overfits quickly. Training accuracy at the 50th epoch is 99.47%, while validation accuracy remained just 77.78%. On lower epochs, the validation accuracy is far poor. The training and validation loss were 0.0471 and 1.488. We found many variation in training and validation loss and classification accuracy. Then MFCC and Mel-Spectrogram features were also tested both individually and combined. When these are combined, we found that better classification accuracy is obtained. So we continued with the combined feature as our input.
We fine-tuned Resnet50 on MFCC and Mel-spectrogram features. As shown in Fig. 4, the Resnet50 model still has a lot of variations compared to the Vision transformer. The training accuracy and time for this is 98.88% and 18 h, respectively. The maximum validation classification was obtained to be 93.87% for both Resnet50 and Transformer (Table 3). However, for this accuracy vision transformer has a training loss of 0.2509 and validation loss of 0.1991. On the other side, Resnet50 has a lot of variation in Training loss 4.0 × 10 −4 to 0.04 and in Validation loss of 0.2768-1.538.
The classification accuracy of vision transformer on testing dataset ranges between 89.0 and 90.0% in different training testing experiments (Table 4). Comparatively, the accuracy of Resnet50 ranges from 84.0 to 87.0% (Table 4). We split the dataset into training and testing (the terms validation and testing are interchangeably used). Our dataset size limits us to divide available audio into training testing and validation. While training, we used training data and validation data and trained the model for a fixed number of epochs. For the transformer, the best model is obtained at about 100 epochs while fine-tuning. While for resnet50 above approximately 50 epoch model start to overfit quickly.
We trained and validated models multiple times. We performed 5-fold cross-validation as shown in Table 2. Interestingly, the Vision Transformer, which is reputed for quick overfitting behavior, did not overfit when the MFCC+MelSpec feature in the form of an image was passed. But it overfits when raw audio is given as input. Resnet50 worked well with raw audio but overfitted when MFCC+MelSpec feature as images are passed.
While training on both raw audio and features, we observed that Resnet50 and VT created their features. Considering the recording devices, the environment (echo) was different, and background noise was present.

Vision transformer verses CNN. Vision transformers have shown remarkable performance in several
computer vision-based tasks. These architectures work on multi-head self-attention mechanisms that can accept a sequence of image patches to encode contextual cues.
We are intrigued by the fundamental differences in the operation of convolution and self-attention that have not been extensively explored in the context of robustness and generalization. It is known that convolutions learn local interactions between elements in the input domain. In contrast, self-attention has shown to learn global exchanges effectively, for example, relationships between far-off objects 57,58 . Given a query embedding, self-attention finds interactions with the other embeddings in the sequence, thereby conditioning the local  59 . In contrast, convolutions are content-independent as the exact filter weights are applied to all inputs regardless of their distinct nature. In this paper, our analysis illustrates that Vision transformer can adjust their receptive field in order to work with the noises in the data and improve the expressivity of the representations.   Table 4. Comparison of our proposed approach with state of the art algorithms (testing accuracy range), on different available datasets.

Conclusion
This paper examined the vision transformer-based approach for audio-based gun-type identification tasks. Various features like MFCC and MelSpectogram were experimented with as previous research suggested. Vision Transformer was found to work better in terms of closeness of training and validation loss, thus giving us a better fitting model. Results indicate that though only a shallow gun audio classification is done in this paper, these techniques can be employed to classify various handguns and rifles based on their shot audio. Collecting the dataset for such a project is very difficult. It has both legal and financial issues. However, such projects are necessary.
Our dataset, though, still captured audio of gunshots in different environments; the plausible audio filters used in videos would have disturbed the original audio signal. Due to such disturbance, some critical information is missing in the audio. We felt it and therefore limited the research only to classify gun types. Had the audios been recorded using the same device with no audio filter and in various environments, we could have classified different handguns and various rifles based on shot audios. Some attackers use audio suppressors. This audio is also classifiable.
To increase the audio-based gun identification task's, the first step will be to collect raw gun audio shots. For each gun among various types of handguns and rifles, with and without suppressors, multiple shots in different environments must be collected. As mentioned in the limitations above, this step requires the support of legal authorities and monitory support.
After dataset collection, we can train a model that will classify different types of guns based on audio of shots. Like CCTV cameras, we will attach an audio input device with CCTVs, and any gunshot will be recognized. In such a way, we can attend to such situations quickly, bypassing human intervention, which usually delays the response and causes damage to intensify.